Authors: Mauro Venticinque | Angelo Schillaci | Daniele Tambone
GitHub project: Bank-Marketing
Date: 2025-05-19
In this project, we analyze data from a Portuguese banking institution’s direct marketing campaigns to identify key factors influencing customer subscription to term deposits.
A deposit account is a bank account maintained by a financial institution in which a customer can deposit and withdraw money. Deposit accounts can be savings accounts, current accounts or any of several other types of accounts explained below.
The dataset includes client demographics, previous campaign interactions, and economic indicators. Our goal is to develop insights that will enhance the effectiveness of future marketing strategies. By applying supervised learning techniques, we aim to predict customer responses and optimize outreach efforts for better engagement and conversion rates.
The report will begin with an Exploratory Data Analysis, examining the variables and their relationship with the target attribute (subscribed) to identify the most influential factors.
age (Integer): age of the customerjob (Categorical): occupationmarital (Categorical): marital statuseducation (Categorical): education leveldefault (Binary): has credit in default?housing (Binary): has housing loan?loan (Binary): has personal loan?contact (Categorical): contact communication typemonth (Categorical): last contact month of yearday_of_week (Integer): last contact day of the
weekduration (Integer): last contact duration, in seconds
(numeric). Important note: this attribute highly affects the output
target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is
obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic
predictive modelcampaign (Integer): number of contacts performed during
this campaign and for this client (numeric, includes last contact)pdays (Integer): number of days that passed by after
the client was last contacted from a previous campaign (numeric; -1
means client was not previously contacted)previous (Integer): number of contacts performed before
this campaign and for this clientpoutcome (Categorical): outcome of the previous
marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)subscribed (Binary): has the client subscribed a term
deposit?Source: UCI Machine Learning Repository
Note: In our dataset there isn’t the bank
balancevariable
| Name | train |
| Number of rows | 32950 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 11 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 4 | 0 |
| education | 0 | 1 | 7 | 19 | 0 | 8 | 0 |
| default | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| housing | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| loan | 0 | 1 | 2 | 7 | 0 | 3 | 0 |
| contact | 0 | 1 | 8 | 9 | 0 | 2 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 10 | 0 |
| day_of_week | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
| poutcome | 0 | 1 | 7 | 11 | 0 | 3 | 0 |
| subscribed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.04 | 10.45 | 17.00 | 32.00 | 38.00 | 47.00 | 98.00 | ▅▇▃▁▁ |
| duration | 0 | 1 | 258.66 | 260.83 | 0.00 | 102.00 | 180.00 | 318.00 | 4918.00 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.57 | 2.77 | 1.00 | 1.00 | 2.00 | 3.00 | 43.00 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 961.90 | 188.33 | 0.00 | 999.00 | 999.00 | 999.00 | 999.00 | ▁▁▁▁▇ |
| previous | 0 | 1 | 0.17 | 0.49 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| emp.var.rate | 0 | 1 | 0.08 | 1.57 | -3.40 | -1.80 | 1.10 | 1.40 | 1.40 | ▁▃▁▁▇ |
| cons.price.idx | 0 | 1 | 93.57 | 0.58 | 92.20 | 93.08 | 93.75 | 93.99 | 94.77 | ▁▆▃▇▂ |
| cons.conf.idx | 0 | 1 | -40.49 | 4.63 | -50.80 | -42.70 | -41.80 | -36.40 | -26.90 | ▅▇▁▇▁ |
| euribor3m | 0 | 1 | 3.62 | 1.74 | 0.63 | 1.34 | 4.86 | 4.96 | 5.04 | ▅▁▁▁▇ |
| nr.employed | 0 | 1 | 5167.01 | 72.31 | 4963.60 | 5099.10 | 5191.00 | 5228.10 | 5228.10 | ▁▁▃▁▇ |
The dataset includes 21 variables and 32,950 rows, with no
missing values.
Categorical variables like job and
education show good diversity, while
default, loan, and
housing have only 3 unique values.
Among numeric variables, age has a fairly normal
distribution (mean ≈ 40, sd ≈ 10), while
duration and pdays are highly skewed,
with extreme values up to 4918 and 999 respectively.
Some variables (e.g., campaign,
previous) have a low median but long tails, indicating
that most observations are clustered at low values.
Macroeconomic variables such as emp.var.rate,
euribor3m, and nr.employed are more
stable, with tight interquartile ranges, suggesting consistent economic
conditions during data collection.
Firstly we see that this dataset are unbaleanced, with the majority of people that have not subscribed.
Correlation Matrix
The
correlation matrix reveals clear patterns among the numerical variables.
Notably, euribor3m, nr.employed, and
emp.var.rate are strongly positively correlated with
each other, these suggest these variables capture similar information
about the economic environment. This should be taken into account in
predictive modeling, as using them together could lead to
multicollinearity. In contrast, variables like
campaign, pdays, and
previous show very weak correlations with most other
features, indicating they may contribute more independently to the
model.
Scatterplot Matrix by
Target
Several variables, such as duration
and pdays, show highly skewed
distributions, which could influence model performance and may
benefit from transformations (e.g., log or binning).While some variables
exhibit linear trends (e.g., euribor3m vs nr.employed), many
scatterplots show dispersed or nonlinear patterns. This suggests that
simple linear models may not fully capture the complexity in the
data.
In certain plots, the blue points (subscribed) are concentrated in specific areas, showing the key factors that influenced successful subscriptions.
Box plot of age
It is
harder to see older people say no
Box plot of emp.var.rate
Text
Box plot of euribor3m
Text
Distribution of Age
The
age distribution is right-skewed, with a peak around 30–40 years old.
The proportion of people that have subscribed is higher among those over
60.This may be due to greater financial stability in older age
groups.
Distribution of Job
The
distribution of the occupation is not uniform, with the majority of
people that are admin. The proportion of people that have subscribed is
among the higest between all the occupation. This is probably due to the
fact that people that are admin have a higher income and are more likely
to subscribe. While student and retired people have a higher proportion
of subscription, this explain that we saw in the previous plot that the
older people and the people with higher education level are more likely
to subscribe.
Distribution of Education
About Education Level, we can see that the distribution of the education
level is not uniform, with the majority of people that have a university
degree. The proportion of people that have a university degree and that
have subscribed is among the higest between all the education level.
This is probably due to the fact that people that have a university
degree have a higher income and are more likely to subscribe.
Distribution of Marital
status
Text.
Distribution of Contact
Text.
Distribution of Contacts
About previous campaign, while most clients were not previously
contacted, the success rate is visibly higher among those who were
previously contacted more than once or had a successful prior outcome.
This suggests that prior engagement is positively associated with
subscription, but they are a small part of sample.
Distribution of Days of
Week
The distribution of the last contact day of the week
is uniform, with the majority of people that have been contacted on
Thursday. The proportion of people that have subscribed is among the
higest when the last contact day of the week is on the middle of
week.
Distribution of Months
Instead, the distribution of the last contact month of the year is not
uniform, with the majority of people that have been contacted in May.
The proportion of people that have subscribed is among the higest when
the last contact month of the year is in March, December, September and
October. This is probably due to the fact that people are more likely to
subscribe when they have more money and not during the summer.
Distribution of Duration
The duration of the last contact is right-skewed, with a peak around
0-100 seconds. The proportion of people that have subscribed is higher
among people that have been contacted for a longer duration. This is
probably due to the fact that people that have been contacted for a
longer duration are more interested to subscribe.
The Exploratory Data Analysis reveals several important insights into the factors that influence the likelihood of subscription in this dataset. Below there is a summary of the key findings:
In summary, the analysis suggests that financial conditions, previous campaign interactions, and macroeconomic indicators are strong predictors of subscription behavior. Demographic factors such as age, occupation, and education level also contribute meaningfully to the outcome.
In the next section, we will use these EDA findings to conduct a preliminary skim of the most influential variables, based on the visual trends observed in the plots.
With a view to training the model, we apply one-hot encoding.
Based on the Exploratory Data Analysis (EDA), we selected only the most relevant variables and according to the following patterns:
months of
September, October, December, and March.previous campaign either subscribed
or refused are more likely to subscribe, compared to those who were not
contacted.cons.price.idx values greater than 93 are linked to a
higher chance of subscription.cons.conf.idx values above the median are associated
with a higher likelihood of subscription.euribor3m values below the mean correspond to a higher
probability of subscription.emp.var.rate values below 0 are more likely to be
associated with subscription.We transform these continuous variables into binary indicators reflecting these insights.
We obtain the following dataset:
## 'data.frame': 32950 obs. of 19 variables:
## $ age : int 30 39 43 27 56 41 57 46 61 35 ...
## $ previous : int 1 0 0 0 1 0 0 0 1 0 ...
## $ negative_emp: num 1 1 0 1 1 0 0 1 1 1 ...
## $ low_cpi : num 1 0 0 1 0 0 0 1 1 1 ...
## $ high_cci : num 0 1 0 0 1 1 0 1 1 1 ...
## $ low_euribor : num 1 1 0 1 1 0 0 1 1 1 ...
## $ emp_cat : chr "Negative" "Negative" "Positive or Zero" "Negative" ...
## $ university : num 0 1 0 0 0 0 1 0 0 1 ...
## $ p_course : num 1 0 0 0 1 0 0 1 1 0 ...
## $ job_student : num 0 0 0 1 0 0 0 0 0 0 ...
## $ job_retired : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job_admin : num 0 0 0 0 0 0 0 0 0 1 ...
## $ month_sep : num 0 0 0 0 0 0 0 0 0 1 ...
## $ month_oct : num 0 0 0 0 0 0 0 0 1 0 ...
## $ month_dec : num 0 0 0 0 0 0 0 1 0 0 ...
## $ month_mar : num 0 0 0 1 0 0 0 0 0 0 ...
## $ p_failure : num 1 0 0 0 1 0 0 0 1 0 ...
## $ p_success : num 0 0 0 0 0 0 0 0 0 0 ...
## $ target : num 1 1 1 1 1 1 1 1 1 1 ...
full_model <- glm(target ~ ., data = full_df, family = binomial)
stepwise <- stepAIC(full_model, direction = "both", trace = FALSE)
vif(stepwise)
## previous negative_emp low_cpi high_cci low_euribor university
## 4.029297 4.686861 1.664500 1.320400 5.015548 1.225140
## p_course job_student job_retired job_admin month_sep month_oct
## 1.113935 1.072312 1.093949 1.193359 1.110719 1.072451
## month_dec month_mar p_failure p_success
## 1.050510 1.038411 2.887853 2.653632
# predictore removed by Stepwise
stepwise$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## target ~ age + previous + negative_emp + low_cpi + high_cci +
## low_euribor + emp_cat + university + p_course + job_student +
## job_retired + job_admin + month_sep + month_oct + month_dec +
## month_mar + p_failure + p_success
##
## Final Model:
## target ~ previous + negative_emp + low_cpi + high_cci + low_euribor +
## university + p_course + job_student + job_retired + job_admin +
## month_sep + month_oct + month_dec + month_mar + p_failure +
## p_success
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 32932 18639.69 18675.69
## 2 - emp_cat 0 0.0000000 32932 18639.69 18675.69
## 3 - age 1 0.4008751 32933 18640.09 18674.09
df_no_target <- subset(full_df, select = -target)
fit_lasso <- glmnet(x = as.matrix(df_no_target),
y = target,
alpha = 1,
family = "binomial",
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
cv_fit <- cv.glmnet(
x = as.matrix(df_no_target),
y = target,
alpha = 1,
family = "binomial"
)
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in storage.mode(xd) <- "double": NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
## Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
plot(cv_fit)
# predictors selected by Lasso
coef(cv_fit, s = "lambda.1se")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -3.21439368
## age .
## previous .
## negative_emp 0.36233500
## low_cpi -0.05967353
## high_cci 0.48858934
## low_euribor 1.19013520
## emp_cat .
## university 0.01623647
## p_course .
## job_student 0.23133147
## job_retired 0.26855064
## job_admin .
## month_sep 0.24156433
## month_oct 0.53806790
## month_dec 0.17381524
## month_mar 0.93326810
## p_failure -0.04753069
## p_success 1.59751521
lasso_mod<-glm(target~negative_emp+low_cpi+high_cci+low_euribor+university+job_student+
job_retired+month_sep+month_oct+month_dec+month_mar+p_failure+p_success, data=full_df, family=binomial)
k_fold_mod <- function(data, target_col, model_formula, k = 10) {
set.seed(123)
folds <- createFolds(data[[target_col]], k = k, list = TRUE, returnTrain = FALSE)
acc_best_vec <- numeric(k)
f1_best_vec <- numeric(k)
auc_vec <- numeric(k)
aic_vec <- numeric(k)
acc_thresh_vec <- numeric(k)
f1_thresh_vec <- numeric(k)
sensitivity_vec <- numeric(k)
specificity_vec <- numeric(k)
for (i in 1:k) {
test_idx <- folds[[i]]
train_fold <- data[-test_idx, ]
test_fold <- data[test_idx, ]
# Fit model
fitted_model <- glm(model_formula, data = train_fold, family = binomial)
pred_probs_train <- predict(fitted_model, newdata = train_fold, type = "response")
actual_train <- train_fold[[target_col]]
# Threshold search
thresholds <- seq(0.01, 0.99, by = 0.01)
acc_scores <- numeric(length(thresholds))
f1_scores <- numeric(length(thresholds))
for (j in seq_along(thresholds)) {
threshold <- thresholds[j]
preds <- ifelse(pred_probs_train > threshold, 1, 0)
acc_scores[j] <- mean(preds == actual_train)
cm <- table(Predicted = preds, Actual = actual_train)
precision <- ifelse("1" %in% rownames(cm) && "1" %in% colnames(cm),
cm["1", "1"] / sum(cm["1", ]), 0)
recall <- ifelse("1" %in% rownames(cm) && "1" %in% colnames(cm),
cm["1", "1"] / sum(cm[, "1"]), 0)
f1 <- ifelse((precision + recall) > 0,
2 * (precision * recall) / (precision + recall), 0)
f1_scores[j] <- f1
}
# Best thresholds
best_acc_threshold <- thresholds[which.max(acc_scores)]
best_f1_threshold <- thresholds[which.max(f1_scores)]
acc_thresh_vec[i] <- best_acc_threshold
f1_thresh_vec[i] <- best_f1_threshold
# Test predictions
pred_probs_test <- predict(fitted_model, newdata = test_fold, type = "response")
actual_test <- test_fold[[target_col]]
pred_acc <- ifelse(pred_probs_test > best_acc_threshold, 1, 0)
pred_f1 <- ifelse(pred_probs_test > best_f1_threshold, 1, 0)
acc_best_vec[i] <- mean(pred_acc == actual_test)
# Confusion matrix for F1 threshold
cm <- table(Predicted = pred_f1, Actual = actual_test)
tp <- ifelse("1" %in% rownames(cm) && "1" %in% colnames(cm), cm["1", "1"], 0)
tn <- ifelse("0" %in% rownames(cm) && "0" %in% colnames(cm), cm["0", "0"], 0)
fp <- ifelse("1" %in% rownames(cm) && "0" %in% colnames(cm), cm["1", "0"], 0)
fn <- ifelse("0" %in% rownames(cm) && "1" %in% colnames(cm), cm["0", "1"], 0)
precision <- ifelse((tp + fp) > 0, tp / (tp + fp), 0)
recall <- ifelse((tp + fn) > 0, tp / (tp + fn), 0)
f1_best_vec[i] <- ifelse((precision + recall) > 0,
2 * (precision * recall) / (precision + recall), 0)
sensitivity_vec[i] <- ifelse((tp + fn) > 0, tp / (tp + fn), NA)
specificity_vec[i] <- ifelse((tn + fp) > 0, tn / (tn + fp), NA)
auc_vec[i] <- tryCatch({
roc_obj <- roc(actual_test, pred_probs_test)
as.numeric(auc(roc_obj))
}, error = function(e) NA)
aic_vec[i] <- AIC(fitted_model)
}
return(list(
Accuracy_at_best_threshold = paste0(round(mean(acc_best_vec, na.rm = TRUE), 4),
" (threshold = ", round(mean(acc_thresh_vec, na.rm = TRUE), 2), ")"),
F1_at_best_threshold = paste0(round(mean(f1_best_vec, na.rm = TRUE), 4),
" (threshold = ", round(mean(f1_thresh_vec, na.rm = TRUE), 2), ")"),
Sensitivity = round(mean(sensitivity_vec, na.rm = TRUE), 4),
Specificity = round(mean(specificity_vec, na.rm = TRUE), 4),
AUC = round(mean(auc_vec, na.rm = TRUE), 4),
AIC = round(mean(aic_vec, na.rm = TRUE), 2)
))
}
evaluate_threshold <- function(probs, target, threshold) {
pred <- ifelse(probs > threshold, 1, 0)
cm <- table(Predicted = pred, Actual = target)
tp <- ifelse("1" %in% rownames(cm) && "1" %in% colnames(cm), cm["1", "1"], 0)
tn <- ifelse("0" %in% rownames(cm) && "0" %in% colnames(cm), cm["0", "0"], 0)
fp <- ifelse("1" %in% rownames(cm) && "0" %in% colnames(cm), cm["1", "0"], 0)
fn <- ifelse("0" %in% rownames(cm) && "1" %in% colnames(cm), cm["0", "1"], 0)
accuracy <- (tp + tn) / (tp + tn + fp + fn)
precision <- ifelse((tp + fp) > 0, tp / (tp + fp), 0)
recall <- ifelse((tp + fn) > 0, tp / (tp + fn), 0)
specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), 0)
f1 <- ifelse((precision + recall) > 0,
2 * (precision * recall) / (precision + recall), 0)
return(list(
Threshold = threshold,
Accuracy = round(accuracy, 4),
F1 = round(f1, 4),
Sensitivity = round(recall, 4),
Specificity = round(specificity, 4)
))
}
summary(stepwise)
##
## Call:
## glm(formula = target ~ previous + negative_emp + low_cpi + high_cci +
## low_euribor + university + p_course + job_student + job_retired +
## job_admin + month_sep + month_oct + month_dec + month_mar +
## p_failure + p_success, family = binomial, data = full_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.59442 0.04935 -72.834 < 2e-16 ***
## previous 0.12453 0.05844 2.131 0.03308 *
## negative_emp 0.70668 0.09106 7.761 8.43e-15 ***
## low_cpi -0.52384 0.05153 -10.166 < 2e-16 ***
## high_cci 0.65679 0.04461 14.724 < 2e-16 ***
## low_euribor 1.42568 0.09004 15.833 < 2e-16 ***
## university 0.17904 0.04515 3.965 7.33e-05 ***
## p_course 0.13697 0.06137 2.232 0.02563 *
## job_student 0.54569 0.09780 5.580 2.41e-08 ***
## job_retired 0.52484 0.07732 6.788 1.14e-11 ***
## job_admin 0.12891 0.04714 2.735 0.00624 **
## month_sep 0.31980 0.10908 2.932 0.00337 **
## month_oct 0.74148 0.09603 7.722 1.15e-14 ***
## month_dec 0.74440 0.17933 4.151 3.31e-05 ***
## month_mar 1.12304 0.10851 10.349 < 2e-16 ***
## p_failure -0.57195 0.09453 -6.050 1.44e-09 ***
## p_success 1.29578 0.11245 11.523 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 18640 on 32933 degrees of freedom
## AIC: 18674
##
## Number of Fisher Scoring iterations: 5
summary(lasso_mod)
##
## Call:
## glm(formula = target ~ negative_emp + low_cpi + high_cci + low_euribor +
## university + job_student + job_retired + month_sep + month_oct +
## month_dec + month_mar + p_failure + p_success, family = binomial,
## data = full_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.55607 0.04722 -75.313 < 2e-16 ***
## negative_emp 0.71204 0.09110 7.816 5.43e-15 ***
## low_cpi -0.54579 0.05076 -10.752 < 2e-16 ***
## high_cci 0.67720 0.04432 15.279 < 2e-16 ***
## low_euribor 1.44133 0.08972 16.064 < 2e-16 ***
## university 0.18784 0.04162 4.513 6.39e-06 ***
## job_student 0.50263 0.09623 5.223 1.76e-07 ***
## job_retired 0.48223 0.07610 6.337 2.35e-10 ***
## month_sep 0.33379 0.10882 3.067 0.00216 **
## month_oct 0.75179 0.09601 7.830 4.88e-15 ***
## month_dec 0.74449 0.17938 4.150 3.32e-05 ***
## month_mar 1.13806 0.10864 10.475 < 2e-16 ***
## p_failure -0.41702 0.05916 -7.049 1.80e-12 ***
## p_success 1.48162 0.07376 20.087 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32949 degrees of freedom
## Residual deviance: 18656 on 32936 degrees of freedom
## AIC: 18684
##
## Number of Fisher Scoring iterations: 5
# Compare the models
stepwise_results <- k_fold_mod(data = full_df, target_col = "target", model_formula = stepwise)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
lasso_results <- k_fold_mod(data = full_df, target_col = "target", model_formula = lasso_mod)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
print(stepwise_results)
## $Accuracy_at_best_threshold
## [1] "0.8992 (threshold = 0.54)"
##
## $F1_at_best_threshold
## [1] "0.484 (threshold = 0.2)"
##
## $Sensitivity
## [1] 0.5699
##
## $Specificity
## [1] 0.9006
##
## $AUC
## [1] 0.7755
##
## $AIC
## [1] 16808.24
print(lasso_results)
## $Accuracy_at_best_threshold
## [1] "0.899 (threshold = 0.54)"
##
## $F1_at_best_threshold
## [1] "0.4875 (threshold = 0.2)"
##
## $Sensitivity
## [1] 0.5568
##
## $Specificity
## [1] 0.9079
##
## $AUC
## [1] 0.7758
##
## $AIC
## [1] 16816.71
# Threshold evaluation
probs_stepwise <- predict(stepwise, type = "response")
probs_lasso <- predict(lasso_mod, type = "response")
res_step_05 <- evaluate_threshold(probs_stepwise, target, 0.5)
res_step_02 <- evaluate_threshold(probs_stepwise, target, 0.2)
res_lasso_05 <- evaluate_threshold(probs_lasso, target, 0.5)
res_lasso_02 <- evaluate_threshold(probs_lasso, target, 0.2)
# Unisci tutti i risultati in una lista
results_list <- list(
Stepwise_0.5 = res_step_05,
Stepwise_0.2 = res_step_02,
LASSO_0.5 = res_lasso_05,
LASSO_0.2 = res_lasso_02
)
# Trasforma in data.frame
results_df <- do.call(rbind, lapply(names(results_list), function(name) {
res <- results_list[[name]]
model <- sub("_.*", "", name)
threshold <- res$Threshold
data.frame(
Model = model,
Threshold = threshold,
Accuracy = res$Accuracy,
F1 = res$F1,
Sensitivity = res$Sensitivity,
Specificity = res$Specificity
)
}))
# Visualizza il risultato
print(results_df)
## Model Threshold Accuracy F1 Sensitivity Specificity
## 1 Stepwise 0.5 0.8988 0.3114 0.2031 0.9871
## 2 Stepwise 0.2 0.8625 0.4856 0.5762 0.8988
## 3 LASSO 0.5 0.8986 0.3104 0.2026 0.9870
## 4 LASSO 0.2 0.8682 0.4876 0.5566 0.9078
Social and economic context attributes:
emp.var.rate(Integer): employment variation rate - quarterly indicatorcons.price.idx(Integer): consumer price index - monthly indicatorcons.conf.idx(Integer): consumer confidence index - monthly indicatoreuribor3m(Integer): euribor 3 month rate - daily indicatornr.employed(Integer): number of employees - quarterly indicator